Data Preprocessing

  1. First, I wanted to investigate which variables are time dependent and also exclude some that were clearly unnecessary (i.e., “SITE”,“COLPROT”,“ORIGPROT”, “FLDSTRENG”,“FSVERSION”,“IMAGEUID”, “Month_bl”,“Month”,“M”,“update_stamp”).

  2. Merge time dependent and independent variables into the long_dat data frame. Also, I recoded the time points in the VISCODE variable into integers.

long_dat <- dat[, c(ivars[,1], nivars[,1])] %>%
  mutate(VISCODE = match(VISCODE, c("bl", "m03", "m06", "m12", "m18", "m24", 
                                    "m30","m36", "m42", "m48", "m54", "m60", 
                                    "m66", "m72","m78", "m84", "m90", "m96", 
                                    "m102", "m108","m114", "m120", "m126", 
                                    "m132", "m144", "m156"))-1) %>%
  relocate(RID, PTID, VISCODE) %>%
  arrange(RID, VISCODE)
  1. In the original data frame there were quite some _bl or _BL variables. Thus, I wanted to check whether these columns had already been integrated or not at each corresponding time point for each participant. Surprise, the test was negative.

  2. Therefore, I continued with merging the _bl/_BL variables with the corresponding time dependent variable for each participant. Additionally, I specified the data type of each variable individually for optimal control and oversight over the data structure.

  3. Transform Long to Wide Data Format

## # A tibble: 6 × 1,153
##   RID   PTID         AGE PTGENDER PTEDUCAT PTETHCAT PTRACCAT PTMARRY APOE4 FDG_0
##   <fct> <chr>      <dbl> <fct>       <int> <fct>    <fct>    <fct>   <int> <dbl>
## 1 2     011_S_0002  74.3 Male           16 Not His… White    Married     0  1.37
## 2 3     011_S_0003  81.3 Male           18 Not His… White    Married     1  1.08
## 3 4     022_S_0004  67.5 Male           10 Hisp/La… White    Married     0 NA   
## 4 5     011_S_0005  73.7 Male           16 Not His… White    Married     0  1.29
## 5 6     100_S_0006  80.4 Female         13 Not His… White    Married     0 NA   
## 6 7     022_S_0007  75.4 Male           10 Hisp/La… More th… Married     1 NA   
## # ℹ 1,143 more variables: FDG_2 <dbl>, FDG_7 <dbl>, FDG_11 <dbl>, FDG_12 <dbl>,
## #   FDG_13 <dbl>, FDG_14 <dbl>, FDG_15 <dbl>, FDG_16 <dbl>, FDG_17 <dbl>,
## #   FDG_18 <dbl>, FDG_19 <dbl>, FDG_21 <dbl>, FDG_22 <dbl>, FDG_23 <dbl>,
## #   FDG_24 <dbl>, FDG_3 <dbl>, FDG_4 <dbl>, FDG_5 <dbl>, FDG_6 <dbl>,
## #   FDG_9 <dbl>, FDG_8 <dbl>, FDG_10 <dbl>, FDG_25 <dbl>, FDG_20 <dbl>,
## #   FDG_1 <dbl>, PIB_0 <dbl>, PIB_2 <dbl>, PIB_7 <dbl>, PIB_11 <dbl>,
## #   PIB_12 <dbl>, PIB_13 <dbl>, PIB_14 <dbl>, PIB_15 <dbl>, PIB_16 <dbl>, …

Attrition Analysis

Based on the number of participants measured at any time point I made a frequency plot to get a first idea of the sampling frequency.

Domains

Demographics

Cognitive Tests

Biomedical Imaging

Biomarkers

Based on these findings it appears that time point 9 is a cut-off where the number of measurements drop quite strongly. Time point 9 corresponds to month 42 (i.e., 3.5 years) of the follow-up.

Polygenic Risk Score for Educational attainment

The merge(by.x, by.y) function creates a new data frame that only keeps those rows for which there is a matching key (in our case PTID). Therefore, we do have genetic data from 2 additional individuals for which we do not have any other measurements. The final data frame for which testing data and genetic data is available is thus, 1408 (N).

Plot PGS EA vs. Actual EA

Based on this plot, we can see a positive relationship between the polygenic score for education attainment and actual years of education. This means that with a higher PGS score comes higher genetic capacity for educational attainment.

We ran Pearson’s correlation which resulted in r = 0.286 (p-value < 2.2e-16)

Create Residuals

To get the residual we regressed the polygenic risk score for educational attainment against actual EA including the variables SEX & AGE as covariates. The results are depicted in the density plot.

How to interpret the Residuals?

It is important to correctly interpret the residual scores. The correct way to interpret them is, that a high residual score means that the individual has over-performed relative to his or her genetic capacity. See for example in this table for a short proof:

##   Actual Predicted Residuals
## 1     18  16.91911  1.080886
## 2     18  16.91911  1.080886
## 3     18  16.91911  1.080886
## 4     18  16.91911  1.080886
## 5     18  16.91911  1.080886
## 6     18  16.91911  1.080886

It is interesting to see that the residual plot is not normally distributed. Does this suggest that we should continue using non-parametric analysis techniques?

Survival Analysis

Using the ntile function from dplyr, the lower thirtile will be assigned value 1 (~ negative residual), middle thirtile value 2 and upper thirtile value 3 (~positive residual).

Mini-Mental State Examination (MMSE)

## [1] "Log Rank Test p-value:" "3.6893601014735e-26"

Alzheimer’s Disease Assessment Scale

ADAS11

## [1] "Log Rank Test p-value:" "8.47887296650944e-08"

### ADAS13

## [1] "Log Rank Test p-value:" "3.40154805596402e-06"

### ADASQ4

## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## [1] "Log Rank Test p-value:" "NaN"

CDRSB

## [1] "Log Rank Test p-value:" "1.84990010368541e-13"

DIGITSCORE

## [1] "Log Rank Test p-value:" "1.94887935995934e-07"

ECOGPTDIVAT

## [1] "Log Rank Test p-value:" "0.789248619846041"

ECOGSPVISPAT

## [1] "Log Rank Test p-value:" "0.00886077736698255"

FAQ

## [1] "Log Rank Test p-value:" "1.7859427627785e-05"

LDELTOTAL

## [1] "Log Rank Test p-value:" "1.69990660816756e-24"

MOCA

## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## [1] "Log Rank Test p-value:" "NaN"

RAVLT_Forgetting

## [1] "Log Rank Test p-value:" "0.0551585056141432"

RAVLT_Immediate

## [1] "Log Rank Test p-value:" "8.10150278996432e-28"

RAVLT_learning

## [1] "Log Rank Test p-value:" "4.05912989451853e-07"

RAVLT_forgetting

## [1] "Log Rank Test p-value:" "0.0551585056141432"

TRABSCORE

## [1] "Log Rank Test p-value:" "6.82902375980929e-21"